Task 1.

PD

\begin{gather*} f( x_{1} ,x_{2}) =( x_{1} +x_{2})^{2}\\ g_{PD}^{1}( z) =E_{X_{2}}( z+x_{2})^{2} =z^{2} +2zE_{X_{2}} x_{2} +( E_{X_{2}} x_{2})^{2} =z^{2} \end{gather*}

(In PD we calculate $\displaystyle E_{X_{2}}$ independently of $\displaystyle z$)

MP

\begin{array}{l} g_{1}^{MP}( z) =E_{x_{_{2}} |x_{_{1}} =z}( x_{1} +x_{2})^{2} =\\ 2z^{2} \end{array}

ALE

\begin{array}{l} g_{1}^{AL}( z) =\int _{-1}^{z} E_{x_{_{2}} |x_{_{1}} =v}\frac{\partial ( x_{1} +x_{2})^{2}}{\partial x_{1}} dv=\\ \int _{-1}^{z} E_{x_{_{2}} |x_{_{1}} =v} 2( x_{1} +x_{2}) dv=\\ \int _{-1}^{z} 4vdv=\\ \left[ 2v^{2}\right]_{-1}^{z} =\\ 2z^{2} +2 \end{array}

Task 2.

Like in the previous assignment, we will be working with the "MagicTelescope" dataset. This dataset is designed to mimic the detection of high-energy gamma particles. The 'TARGET' column in this dataset contains two types of values:

  • 'g' represents gamma (signal), which we will label as 0,
  • 'h' stands for hadron (background), which we will assign the label 1.

We'll first use the XGBoost model, and later the CatBoost model for comparison.

Let me remind you that the dataset looks like this:

Out[314]:
fLength: fWidth: fSize: fConc: fConc1: fAsym: fM3Long: fM3Trans: fAlpha: fDist:
0 28.7967 16.0021 2.6449 0.3918 0.1982 27.7004 22.0110 -8.2027 40.0920 81.8828
1 31.6036 11.7235 2.5185 0.5303 0.3773 26.2722 23.8238 -9.9574 6.3609 205.2610
2 162.0520 136.0310 4.0612 0.0374 0.0187 116.7410 -64.8580 -45.2160 76.9600 256.7880
3 23.8172 9.5728 2.3385 0.6147 0.3922 27.2107 -6.4633 -7.1513 10.4490 116.7370
4 75.1362 30.9205 3.1611 0.3168 0.1832 -5.5277 28.5525 21.8393 4.6480 356.4620

We'll train the XGBClassifier model with the default parameters.

Out[315]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=None, n_jobs=None,
              num_parallel_tree=None, random_state=None, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We can check its accuracy on the test set:

Accuracy: 0.8772344900105152

For the rest of the task, we'll mainly focus on a randomly selected 100-element subset of the test set.

Let's create our own function to calculate Ceteris Paribus explanations.

For each row in the first 10 rows of the sampled set, we'll plot the Ceteris Paribus profile for the feature fDist:. The red dots represent the original values of the feature, while the corresponding line represents the "what-if" scenarios for a given value of fDist:.

We can see for example that observations 2503 and 13317 have a different profile for this feature. Let's plot just the two of them below.

Out[330]:
fLength: fWidth: fSize: fConc: fConc1: fAsym: fM3Long: fM3Trans: fAlpha: fDist: TARGET
2503 47.8108 20.5476 3.0445 0.1913 0.1015 57.6978 -19.8689 7.6107 8.4572 196.3960 0
13317 23.6796 13.4887 2.5212 0.4583 0.2455 -40.0835 -18.0034 6.6805 49.3332 127.5187 1

We can see there's a lot of correlation between fDist: and other features:

Out[359]:
fLength:     0.418466
fWidth:      0.336816
fSize:       0.437041
fConc:      -0.328332
fConc1:     -0.304625
fAsym:      -0.206730
fM3Long:     0.037025
fM3Trans:    0.011427
fAlpha:     -0.220556
fDist:       1.000000
Name: fDist:, dtype: float64

There's also a lot of correlation between features in general, as seen on this correlation heatmap:

Out[363]:
<Axes: >
No description has been provided for this image

I've tried to see what happens when we set the values at columns with high correlation (I set the threshold at absolute value $\geq0.4$) to the mean value from the entire dataset.

I was suprised by how flat was the line for the observation 13317. It probably means that the main variability in prediction to fDist: for that observation was not due to fDist: itself, but rather due to features fLength: and fSize: that are highly correlated.

Let's now train the CatBoost model.

Out[394]:
<catboost.core.CatBoostClassifier at 0x2b6038e50>

We can now compare the PDP profiles for the XGBoost and Catboost models (also with custom built function).

No description has been provided for this image
No description has been provided for this image

As we can see, they are incredibly similar. They also share many similarities to the first Ceteris Paribus plot for the XGBoost model. For example they both inhibit a "spike" between 200 and 300. Note that the first Ceteris Paribus plot was only for the first 10 rows of the random sample, while PDP plots are for the entire 100 rows.

The Catboost plot is also much more smooth, probably due to the higher tree depth I set for the Catboost model. This way there's many smaller "jumps" in the Catboost PDP plot, compared to less bigger "jumps" in XGBoost plot, which creates the illusion of a smooth curve.